Combine two data sets | Variable names are different

Rodrigo M. Carrillo Larco

Join Date: Sep 2015

Posts: 1
#1

Combine two data sets | Variable names are different

04 Sep 2015, 11:42

Dear Stata listers, I have to merge two data sets. Both have a unique identifier. Unfortunately, most (if not all) of the other variables have different names. For example, in one I would have pdead (participant dead) and in the other it is palive (participant alive). Five years ago somebody posted a similar question: http://www.stata.com/statalist/archi.../msg00394.html
I could check the variables one by one. This will be time consuming and tedious. If there is any way to merge these two data sets easier? Thanks Best

st: combine two data sets whose variable names are all different

http://www.stata.com
Tags: None
Joe Canner

Join Date: Mar 2014

Posts: 580
#2

04 Sep 2015, 12:17

Rodrigo,

Unfortunately the answer (or non-answer) now is the same as it was then. Stata cannot read minds so there is no way for it to know what variables are the same (or similar) in two different data sets unless you give the variables the same name. What did you have in mind?

Regards,
Joe
Comment
Giovanni Colitti

Join Date: Feb 2016

Posts: 41
#3

28 Aug 2018, 12:50

Hi Rodrigo and Joe,

Why can't Stata search through the datasets that are to be merged and list unique observations of all variables and recommend "merge by" or "join by" variables based on those variables that have the same unique identifiers? This would likely take a long time and, with large datasets, not be preferable to manually going through and renaming "merge by" variables to match across datasets. Alternatively, is there a way to tell Stata when merging that you want the variable ID1 in dataset #1 to be matched with the variable ID2 in dataset #2?--So, ID1 and ID2 are the same unique identifiers just different variable names.

Gio
Comment
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#4

29 Aug 2018, 08:40

Here, an excerpt from the Stata Manual concerning - merge - command:

varlist specifies variables common to
both datasets that together uniquely identify single observations in both datasets. For instance, suppose
you have a dataset of customer information, called customer.dta, and have a second dataset of other
information about roughly the same customers, called other.dta. Suppose further that both datasets
identify individuals by using the pid variable, and there is only one observation per individual in
each dataset. You would merge the two datasets by typing
. use customer
. merge 1:1 pid using other

That said, if you are sure every row of one dataset corresponds to the other, you may try this command:

Code:

. */ One-to-one merge by observation merge 1:1 _n using filename

Last edited by Marcos Almeida; 29 Aug 2018, 08:45.

Best regards,

Marcos
Comment
Giovanni Colitti

Join Date: Feb 2016

Posts: 41
#5

07 Dec 2018, 10:04

I think there may be a solution to this problem, provided I'm not misinterpreting the issue.

If the OP's problem is that the unique identifier used to merge the two data sets is named differently in the using data set, then I think one can use a combination of -cfvars- to obtain all variable names from the using data set and -mmerge- to loop merge attempts between the two data sets by all variabels in the using data set.

I'm imagining something like:

Code:

// Create using dataset sysuse auto , clear egen idfalse = seq() egen idtrue = group(make price) keep in 1/5 gen newvariable = runiform(0,1) drop make-foreign save auto2 , replace // Get list of variable names in using data set // Note: I don't think this is the best way to do this clear gen notarealvariable = 0 cfvars auto2.dta // Loop through using data set variable names and attemp merge // If not all observations from using dataset are matched, // then go to next varaible in the using dataset foreach x in `r(oneonly)' { sysuse auto , clear egen id = group(make price) capture mmerge id using auto2.dta , umatch(`x') capture local uobs = `r(uobs)' capture count if _merge == 3 capture local match = `r(N)' if `uobs' != . { assert `match' != `uobs' } }

If a complete match is found, execution will cease and -assert- will display "assertion is false".

So this seems like a very bad idea to begin with, but if the using data set has hundreds or thousands of variables--only one of which being a unique identifier that matches the master data set--then I can imagine how it could save some time. It's important to note that this assumes you know the unique identifier in the master data set and that you are only interested in a complete match between the master and using data set (i.e. all observations from using are matched to master). Also note that -capture- is added because if the merge-by variable type does not match, -mmerge- will throw an error. Finally, if in the using data set there are variables which are not unique identifiers but all values match values of the unique id in the master data set then this method fails.

Both -mmerge- and -cfvars- are available on SSC. As a side note, for those just wanting to merge two data sets with ids that are named differently--without going into the using data set or renaming the ids in the master data set--I've found -mmerge- to be very useful.

Last edited by Giovanni Colitti; 07 Dec 2018, 10:42.
Comment

Announcement

Combine two data sets | Variable names are different

Comment

Comment

Comment

Comment